In [1]:
from IPython.display import Image, HTML, display, YouTubeVideo # IPython rich display Image.
# Ignoring deprecation warning messages.
import warnings
warnings.filterwarnings('ignore')
In [10]:
Image('/home/raul/Ga_Tech/dotCom/dotComLOGO.png')
Out[10]:
Data analysis is now critical to businesses strategy. Businesses increasingly are driven by data analytics, so there is great professional advantage in being able to interact with the vast amount of data of today world. Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking not only will allow one to interact competently, but will help to envision opportunities for improving data-driven decision-making, or to see data-oriented competitive threats.
For all the above, in this challenge, our goals are not only to help the NGO to find out the best strategy for its campaign but also to build an ultimate framework for dealing with the data analytics process.
We wanted a general purpose and easy-to-use framework; so we decided that the key properties of our framework had to be:
Interactivity: our framework must allow interaction with the end-user; we did not want to create a batch process, we wanted something where you can input your questions and receive the immediately answer.
Visualization: when performing Data analysis, the possibility to visualize the results is critical; so we wanted that our framework allows easy and clear visualizations.
Easy-to-use: For us, the learning curve was an important factor; we wanted something easy to understand and learn; so we could start working with it right away.
Reporting: Reporting is another important factor, the results of any analysis worth nothing if no-one read it. So, our framework should give us easy-to-use reporting and sharing capabilities.
Extensions: we did not want a domain specific framework that only allow us to perform one kind of analysis and nothing more; we wanted something that allow us to extend our framework for another purpose and give us the freedom of choosing between different tools.
After a careful analysis of different options; we decided to use Python, with its extensions modules iPython, pandas, matplotlib, numpy and sci-kit learn.
This is the kernel of the framework. The IPython Notebook is a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document:
In [13]:
Image('/home/raul/Pictures/IPython.png', width=500, height=500)
Out[13]:
IPython Notebook are normal files that can be shared with colleagues, converted to other formats such as HTML, PDF, or even Slide shows like this one.</br> Here is a short demo of the notebook’s basic features by the Pybonacci team:
In [2]:
HTML("""<iframe width="500" height="425"
src="http://www.youtube.com/embed/H6dLGQw9yFQ">
</iframe>""")
Out[2]:
Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data. It is a fundamental high-level building block for doing practical, real world data analysis in Python.
Pandas is well suited for:
Key features:
Matplotlib is the most popular Python library for producing plots and other 2D data visualizations.It integrates well with IPython, thus providing a comfortable interactive environment for plotting and exploring data. The plots are also interactive; you canzoom in on a section of the plot and pan around the plot using the toolbar in the plot window.
Some of the many advantages of this library includes:
Here are some examples of the graphics we can create using Matplotlib
In [23]:
HTML("<iframe src=http://matplotlib.org/gallery.html#lines_bars_and_markers width=800 height=350></iframe>")
Out[23]:
Sci-kit Learn is an open source machine learning library for the Python programming language.
Some of the machile learning problems with can handle with sci-kit learn, are:
Here are some examples of Sci-kit Learn
In [24]:
HTML("<iframe src=http://scikit-learn.org/stable/auto_examples/index.html width=800 height=350></iframe>")
Out[24]:
The first step in any good knowledge discovery process is to clean the dataset. This is a important process because incorrect or inconsistent data can lead to false conclusions and misdirected actions. Our principal goals of this phase were, to complete the missing values, to detect the outliers and to remove the no necessary information.
For accomplishing this phase goals we use some of the build-in functions that python pandas module offers; we removed the no statistically significant columns from the dataset, created some more descriptive new columns, and identified some of the most important outliers.
Once that our dataset is clean, we can continue with the next step; the exploratory phase; here our focus was in detecting the key factors and fields that give us a way to predict the donation behavior.
This phase is quite important because the only way to develop intuition for what is going on in an unfamiliar dataset is to immerse yourself into it.
In this phase we made an extensive use of visualizations; our goal for this phase was to get to know the data; we examined some data distributions, validated some assumptions and asked a lot of questions.
Some of the insights and understandings we gained during this phase were:
With all the information and knowledge we gained from the exploratory phase, we were ready to start building a model to test our assumptions and try to predict the donor’s behavior.
We first started with a single model; to build this model, we have created 7 segmentsfrom the different insight we got from the exploration data analysis. These segments are:
Applying this single model to the dataset, we got a profits improvement of 50 %.
We got great results with the single model, but we did not stop there, then we tried to build a more complex model using the Random forest machine learning algorithm to predict the results. We use almost the same variables as future selections to apply to the algorithm; with this new model, we got a profits improvement of 650 %.
The report with all our analysis, could be found here.
In [5]:
Image('/home/raul/Pictures/IPython.png', width=500, height=500)
Out[5]: